POLAR-7B is a scalar reward model based on large-scale pretraining. It adopts an innovative policy discriminative learning paradigm and can effectively distinguish policies and align with human preferences.
Large Language Model
Transformers Supports Multiple Languages